DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
The train.csv data set provided by DonorsChoose contains the following features:
| Feature | Description |
|---|---|
project_id |
A unique identifier for the proposed project. Example: p036502 |
project_title |
Title of the project. Examples:
|
project_grade_category |
Grade level of students for which the project is targeted. One of the following enumerated values:
|
project_subject_categories |
One or more (comma-separated) subject categories for the project from the following enumerated list of values:
Examples:
|
school_state |
State where school is located (Two-letter U.S. postal code). Example: WY |
project_subject_subcategories |
One or more (comma-separated) subject subcategories for the project. Examples:
|
project_resource_summary |
An explanation of the resources needed for the project. Example:
|
project_essay_1 |
First application essay* |
project_essay_2 |
Second application essay* |
project_essay_3 |
Third application essay* |
project_essay_4 |
Fourth application essay* |
project_submitted_datetime |
Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245 |
teacher_id |
A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56 |
teacher_prefix |
Teacher's title. One of the following enumerated values:
|
teacher_number_of_previously_posted_projects |
Number of project applications previously submitted by the same teacher. Example: 2 |
* See the section Notes on the Essay Data for more details about these features.
Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
| Feature | Description |
|---|---|
id |
A project_id value from the train.csv file. Example: p036502 |
description |
Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25 |
quantity |
Quantity of the resource required. Example: 3 |
price |
Price of the resource required. Example: 9.95 |
Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
| Label | Description |
|---|---|
project_is_approved |
A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved. |
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os
from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter
The response tabel is built only on train dataset. For a category which is not there in train data and present in test data, we will encode them with default values Ex: in our test data if have State: D then we encode it as [0.5, 0.05]
or

project_data=pd.read_csv('train_data.csv', nrows=35000)
resource_data=pd.read_csv('resources.csv')
print("number of data points in train data", project_data.shape)
print('-'*50)
print("the attributes of data :", project_data.columns.values)
print("Number of data points in train data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)
# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
price_data=resource_data.groupby('id').agg({'price':'sum','quantity':'sum'}).reset_index()
price_data.head(2)
# join two dataframes in python:
project_data=pd.merge(project_data, price_data, on='id', how='left')
project_data.head(2)
# presence of the numerical digits in a strings with numeric : https://stackoverflow.com/a/19859308/8089731
def hasNumbers(inputString):
return any(i.isdigit()for i in inputString)
p1=project_data[['id','project_resource_summary']]
p1=pd.DataFrame(data=p1)
p1.columns=['id','digits_in_summary']
p1['digits_in_summary']=p1['digits_in_summary'].map(hasNumbers)
# https://stackoverflow.com/a/17383325/8089731
p1['digits_in_summary'] = p1['digits_in_summary'].astype(int)
project_data=pd.merge(project_data,p1,on='id',how='left')
project_data.head(5)
categories=list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list=[]
for i in categories:
temp=""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','):# it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split():# this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','')# if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j=j.replace(' ','')# we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
temp=temp.replace('&','_')# we are replacing the & value into
cat_list.append(temp.strip())
project_data['clean_categories']=cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)
project_data.head(5)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
my_counter.update(word.split())
my_counter
# dict sort by value python: https://stackoverflow.com/a/613218/4084039
cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
sub_cat_list = []
for i in sub_catogories:
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_')
sub_cat_list.append(temp.strip())
project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)
project_data.head(2)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
from collections import Counter
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
my_counter.update(word.split())
# dict sort by value python: https://stackoverflow.com/a/613218/4084039
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))
# merge two column text dataframe:
project_data["essay"] = project_data["project_essay_1"].map(str) +\
project_data["project_essay_2"].map(str) + \
project_data["project_essay_3"].map(str) + \
project_data["project_essay_4"].map(str)
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]
# Combining all the above statemennts
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['essay'].values):
sent = decontracted(sentance)
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
# https://gist.github.com/sebleier/554280
sent = ' '.join(e for e in sent.split() if e not in stopwords)
preprocessed_essays.append(sent.lower().strip())
preprocessed_essays[2000]
from tqdm import tqdm
preprocessed_titles = []
# tqdm is for printing the status bar
for title in tqdm(project_data['project_title'].values):
_title = decontracted(title)
_title = _title.replace('\\r', ' ')
_title = _title.replace('\\"', ' ')
_title = _title.replace('\\n', ' ')
_title = re.sub('[^A-Za-z0-9]+', ' ', _title)
# https://gist.github.com/sebleier/554280
_title = ' '.join(e for e in _title.split() if e not in stopwords)
preprocessed_titles.append(_title.lower().strip())
preprocessed_titles[2000]
project_grade_catogories = list(project_data['project_grade_category'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
project_grade_cat_list = []
for i in tqdm(project_grade_catogories):
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_')
project_grade_cat_list.append(temp.strip())
project_grade_cat_list[2000]
project_data['clean_project_grade_category'] = project_grade_cat_list
project_data.drop(['project_grade_category'], axis=1, inplace=True)
project_data.head()
project_data.drop(['project_essay_1','project_essay_2','project_essay_3','project_essay_4'], axis=1, inplace=True)
project_data.head()
project_data['preprocessed_essays'] = preprocessed_essays
project_data['preprocessed_titles'] = preprocessed_titles
#Replacing Nan's with maximum occured value: https://stackoverflow.com/a/51053916/8089731
project_data['teacher_prefix'].value_counts().argmax()
project_data.fillna(value=project_data['teacher_prefix'].value_counts().argmax(),axis=1,inplace=True)
# School State
project_school_catogories = list(project_data['school_state'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
project_school_cat_list = []
for i in tqdm(project_school_catogories):
temp = ""
# consider we have text like this "Math & Science, Warmth, Care & Hunger"
for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
temp = temp.replace('&','_')
project_school_cat_list.append(temp.strip())
project_data['clean_project_school_category'] = project_school_cat_list
project_data.drop(['school_state'], axis=1, inplace=True)
project_data.head(2)
from tqdm import tqdm
preprocessed_teacher_prefix = []
# tqdm is for printing the status bar
for prefix in tqdm(project_data['teacher_prefix'].values):
_prefix = decontracted(prefix)
_prefix = _prefix.replace('\\r', ' ')
_prefix = _prefix.replace('\\"', ' ')
_prefix = _prefix.replace('\\n', ' ')
_prefix = _prefix.replace('\\.', ' ')
_prefix = re.sub('[^A-Za-z0-9]+', ' ', _prefix)
# https://gist.github.com/sebleier/554280
_prefix = ' '.join(e for e in _prefix.split() if e not in stopwords)
preprocessed_teacher_prefix.append(_prefix.lower().strip())
preprocessed_teacher_prefix[20000]
project_data['clean_teacher_prefix_category'] = preprocessed_teacher_prefix
project_data.drop(['teacher_prefix'], axis=1, inplace=True)
project_data.head(2)
project_data['clean_teacher_prefix_category'][20000]
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# when you plot any graph make sure you use
# a. Title, that describes your plot, this will be very helpful to the reader
# b. Legends if needed
# c. X-axis label
# d. Y-axis label
project_data.columns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn import model_selection
X_train, X_test, y_train, y_test = train_test_split(project_data,project_data['project_is_approved'], test_size=0.33, stratify = project_data['project_is_approved'])
#X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.33, stratify=y_train)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# make sure you featurize train and test data separatly
# when you plot any graph make sure you use
# a. Title, that describes your plot, this will be very helpful to the reader
# b. Legends if needed
# c. X-axis label
# d. Y-axis label
Xtrain_pos = X_train.loc[X_train['project_is_approved'] == 1]
Xtrain_neg = X_train.loc[X_train['project_is_approved'] == 0]
clean_pos_cat = {}
for a in Xtrain_pos['clean_categories'] :
for b in a.split():
if b not in clean_pos_cat :
clean_pos_cat[b] = 1
else :
clean_pos_cat[b] += 1
clean_neg_cat = {}
for a in Xtrain_neg['clean_categories'] :
for b in a.split():
if b not in clean_neg_cat :
clean_neg_cat[b] = 1
else :
clean_neg_cat[b] += 1
clean_cat_xtrain = {}
for a in X_train['clean_categories'] :
for b in a.split():
if b not in clean_cat_xtrain :
clean_cat_xtrain[b] = 1
else :
clean_cat_xtrain[b] += 1
pos_cat_p = {}
for p in clean_cat_xtrain.keys():
pos_cat_p[p] = (clean_pos_cat[p])/float(clean_cat_xtrain[p])
neg_cat_n = {}
for n in clean_cat_xtrain.keys():
neg_cat_n[n] = (clean_neg_cat[n])/float(clean_cat_xtrain[n])
cat_0n_xtrain = []
cat_1p_xtrain = []
for a in X_train["clean_categories"] :
b = a.split()
if len(b) == 1 :
cat_0n_xtrain.append(neg_cat_n[a])
cat_1p_xtrain.append(pos_cat_p[a])
else :
c = neg_cat_n[b[0]]
d =neg_cat_n[b[1]]
e = pos_cat_p[b[0]]
f = pos_cat_p[b[1]]
cat_0n_xtrain.append(c*d)
cat_1p_xtrain.append(e*f)
X_train["cat_0n"] = cat_0n_xtrain
X_train["cat_1p"] = cat_1p_xtrain
import math
cat_0n_xtest = []
cat_1p_xtest = []
for a in X_test["clean_categories"] :
b = a.split()
if len(b) == 1 :
cat_0n_xtest.append(neg_cat_n[a])
cat_1p_xtest.append(pos_cat_p[a])
else :
c = neg_cat_n[b[0]]
d = neg_cat_n[b[1]]
e = pos_cat_p[b[0]]
f = pos_cat_p[b[1]]
cat_0n_xtest.append(c*d)
cat_1p_xtest.append(e*f)
flag = 0
for i in range(len(cat_0n_xtest)):
if(math.isnan(cat_0n_xtest[i])):
flag = 1
print(flag)
X_test["cat_0n"] = cat_0n_xtest
X_test["cat_1p"] = cat_1p_xtest
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
cat_std0n = StandardScaler()
cat_std0n.fit(X_train['cat_0n'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# Now standardize the data with above maen and variance.
cat_0n_xtrain = cat_std0n.transform(X_train['cat_0n'].values.reshape(-1, 1))
cat_0n_xtest = cat_std0n.transform(X_test['cat_0n'].values.reshape(-1, 1))
print(cat_0n_xtrain.shape)
print(cat_0n_xtest.shape)
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
cat_std1p = StandardScaler()
cat_std1p.fit(X_train['cat_1p'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# Now standardize the data with above maen and variance.
cat_1p_xtrain = cat_std1p.transform(X_train['cat_1p'].values.reshape(-1, 1))
cat_1p_xtest = cat_std1p.transform(X_test['cat_1p'].values.reshape(-1, 1))
print(cat_1p_xtrain.shape)
print(cat_1p_xtest.shape)
clean_pos_sub_cat = {}
for a in Xtrain_pos['clean_subcategories'] :
for b in a.split():
if b not in clean_pos_sub_cat :
clean_pos_sub_cat[b] = 1
else :
clean_pos_sub_cat[b] += 1
clean_neg_sub_cat = {}
for a in Xtrain_neg['clean_subcategories'] :
for b in a.split():
if b not in clean_neg_sub_cat :
clean_neg_sub_cat[b] = 1
else :
clean_neg_sub_cat[b] += 1
clean_sub_cat_xtrain = {}
for a in X_train['clean_subcategories'] :
for b in a.split():
if b not in clean_sub_cat_xtrain :
clean_sub_cat_xtrain[b] = 1
else :
clean_sub_cat_xtrain[b] += 1
pos_sub_cat_p = {}
for p in clean_sub_cat_xtrain.keys():
pos_sub_cat_p[p] = (clean_pos_sub_cat[p])/float(clean_sub_cat_xtrain[p])
neg_sub_cat_n = {}
for n in clean_sub_cat_xtrain.keys():
neg_sub_cat_n[n] = (clean_neg_sub_cat[n])/float(clean_sub_cat_xtrain[n])
sub_cat_0n_xtrain = []
sub_cat_1p_xtrain = []
for a in X_train['clean_subcategories'] :
b = a.split()
if len(b) == 1 :
sub_cat_0n_xtrain.append(neg_sub_cat_n[a])
sub_cat_1p_xtrain.append(pos_sub_cat_p[a])
else :
c = neg_sub_cat_n[b[0]]
d =neg_sub_cat_n[b[1]]
e = pos_sub_cat_p[b[0]]
f = pos_sub_cat_p[b[1]]
sub_cat_0n_xtrain.append(c*d)
sub_cat_1p_xtrain.append(e*f)
X_train["sub_cat_0n"] = sub_cat_0n_xtrain
X_train["sub_cat_1p"] = sub_cat_1p_xtrain
import math
sub_cat_0n_xtest = []
sub_cat_1p_xtest = []
for a in X_test['clean_subcategories'] :
b = a.split()
if len(b) == 1 :
sub_cat_0n_xtest.append(neg_sub_cat_n[a])
sub_cat_1p_xtest.append(pos_sub_cat_p[a])
else :
c = neg_sub_cat_n[b[0]]
d = neg_sub_cat_n[b[1]]
e = pos_sub_cat_p[b[0]]
f = pos_sub_cat_p[b[1]]
sub_cat_0n_xtest.append(c*d)
sub_cat_1p_xtest.append(e*f)
flag = 0
for i in range(len(sub_cat_0n_xtest)):
if(math.isnan(sub_cat_0n_xtest[i])):
flag = 1
print(flag)
X_test["sub_cat_0n"] = cat_0n_xtest
X_test["sub_cat_1p"] = cat_1p_xtest
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
sub_cat_std0n = StandardScaler()
sub_cat_std0n.fit(X_train['sub_cat_0n'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# Now standardize the data with above maen and variance.
sub_cat_0n_xtrain = sub_cat_std0n.transform(X_train['sub_cat_0n'].values.reshape(-1, 1))
sub_cat_0n_xtest = cat_std0n.transform(X_test['sub_cat_0n'].values.reshape(-1, 1))
print(sub_cat_0n_xtrain.shape)
print(sub_cat_0n_xtest.shape)
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
sub_cat_std1p = StandardScaler()
sub_cat_std1p.fit(X_train['sub_cat_1p'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# Now standardize the data with above maen and variance.
sub_cat_1p_xtrain = sub_cat_std1p.transform(X_train['sub_cat_1p'].values.reshape(-1, 1))
sub_cat_1p_xtest = sub_cat_std1p.transform(X_test['sub_cat_1p'].values.reshape(-1, 1))
print(sub_cat_1p_xtrain.shape)
print(sub_cat_1p_xtest.shape)
school_state_pos = {}
for a in Xtrain_pos['clean_project_school_category'] :
if a not in school_state_pos :
school_state_pos[a] = 1
else :
school_state_pos[a] += 1
school_state_neg = {}
for a in Xtrain_neg['clean_project_school_category'] :
if a not in school_state_neg :
school_state_neg[a] = 1
else :
school_state_neg[a] += 1
school_state_xtrain = {}
for a in X_train['clean_project_school_category'] :
if a not in school_state_xtrain :
school_state_xtrain[a] = 1
else :
school_state_xtrain[a] += 1
pos_school_state_p = {}
for state in school_state_xtrain.keys():
pos_school_state_p[state] = (school_state_pos[state])/float(school_state_xtrain[state])
neg_school_state_n = {}
for state in school_state_xtrain.keys():
neg_school_state_n[state] = (school_state_neg[state])/float(school_state_xtrain[state])
school_state_0n_xtrain = []
school_state_1p_xtrain = []
for a in X_train['clean_project_school_category'] :
school_state_0n_xtrain.append(neg_school_state_n[a])
school_state_1p_xtrain.append(pos_school_state_p[a])
X_train["school_state_0n"] =school_state_0n_xtrain
X_train["school_state_1p"] = school_state_1p_xtrain
school_state_0n_xtest = []
school_state_1p_xtest = []
for a in X_test['clean_project_school_category'] :
school_state_0n_xtest.append(neg_school_state_n[a])
school_state_1p_xtest.append(pos_school_state_p[a])
flag = 0
for i in range(len(school_state_0n_xtest)):
if(math.isnan(school_state_0n_xtest[i])):
flag = 1
print(flag)
X_test["school_state_0n"] =school_state_0n_xtest
X_test["school_state_1p"] =school_state_1p_xtest
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
school_state_std0 = StandardScaler()
school_state_std0.fit(X_train["school_state_0n"].values.reshape(-1,1)) # finding the mean and standard deviation of this data
school_state_0n_xtrain = school_state_std0.transform(X_train["school_state_0n"].values.reshape(-1, 1))
school_state_0n_xtest = school_state_std0.transform(X_test["school_state_0n"].values.reshape(-1, 1))
print(school_state_0n_xtrain.shape)
print(school_state_0n_xtest.shape)
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
school_state_std1 = StandardScaler()
school_state_std1.fit(X_train["school_state_1p"].values.reshape(-1,1)) # finding the mean and standard deviation of this data
school_state_1p_xtrain = school_state_std1.transform(X_train["school_state_1p"].values.reshape(-1, 1))
school_state_1p_xtest = school_state_std1.transform(X_test["school_state_1p"].values.reshape(-1, 1))
print(school_state_1p_xtrain.shape)
print(school_state_1p_xtest.shape)
teacher_prefix_pos = {}
for a in Xtrain_pos['clean_teacher_prefix_category'] :
if a not in teacher_prefix_pos :
teacher_prefix_pos[a] = 1
else :
teacher_prefix_pos[a] += 1
teacher_prefix_neg = {}
for a in Xtrain_neg['clean_teacher_prefix_category'] :
if a not in teacher_prefix_neg :
teacher_prefix_neg[a] = 1
else :
teacher_prefix_neg[a] += 1
teacher_prefix_xtrain = {}
for a in X_train['clean_teacher_prefix_category'] :
if a not in teacher_prefix_xtrain :
teacher_prefix_xtrain[a] = 1
else :
teacher_prefix_xtrain[a] += 1
pos_teacher_pref = {}
for p in teacher_prefix_xtrain.keys():
pos_teacher_pref[p] = (teacher_prefix_pos[p])/float(teacher_prefix_xtrain[p])
neg_teacher_pref = {}
for n in teacher_prefix_xtrain.keys():
neg_teacher_pref[n] = (teacher_prefix_neg[n])/float(teacher_prefix_xtrain[n])
teacher_pref_0n_xtrain = []
teacher_pref_1p_xtrain = []
for a in X_train['clean_teacher_prefix_category'] :
teacher_pref_0n_xtrain.append(neg_teacher_pref[a])
teacher_pref_1p_xtrain.append(pos_teacher_pref[a])
X_train["teacher_prefix_0n"] = teacher_pref_0n_xtrain
X_train["teacher_prefix_1p"] = teacher_pref_1p_xtrain
teacher_pref_0n_xtest = []
teacher_pref_1p_xtest = []
for a in X_test['clean_teacher_prefix_category'] :
teacher_pref_0n_xtest.append(neg_teacher_pref[a])
teacher_pref_1p_xtest.append(pos_teacher_pref[a])
flag = 0
for i in range(len(teacher_pref_0n_xtest)):
if(math.isnan(teacher_pref_0n_xtest[i])):
flag = 1
print(flag)
X_test["teacher_prefix_0n"] = teacher_pref_0n_xtest
X_test["teacher_prefix_1p"] = teacher_pref_1p_xtest
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
teacher_pref_std0 = StandardScaler()
teacher_pref_std0.fit(X_train['teacher_prefix_0n'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
teacher_prefix_0n_xtrain = teacher_pref_std0.transform(X_train['teacher_prefix_0n'].values.reshape(-1, 1))
teacher_prefix_0n_xtest = teacher_pref_std0.transform(X_test['teacher_prefix_0n'].values.reshape(-1, 1))
print(teacher_prefix_0n_xtrain.shape)
print(teacher_prefix_0n_xtest.shape)
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
teacher_pref_std1 = StandardScaler()
teacher_pref_std1.fit(X_train['teacher_prefix_1p'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
teacher_prefix_1p_xtrain = teacher_pref_std1.transform(X_train['teacher_prefix_1p'].values.reshape(-1, 1))
teacher_prefix_1p_xtest = teacher_pref_std1.transform(X_test['teacher_prefix_1p'].values.reshape(-1, 1))
print(teacher_prefix_1p_xtrain.shape)
print(teacher_prefix_1p_xtest.shape)
proj_grade_pos = {}
for a in Xtrain_pos['clean_project_grade_category'] :
if a not in proj_grade_pos :
proj_grade_pos[a] = 1
else :
proj_grade_pos[a] += 1
proj_grade_neg = {}
for a in Xtrain_neg['clean_project_grade_category'] :
if a not in proj_grade_neg :
proj_grade_neg[a] = 1
else :
proj_grade_neg[a] += 1
proj_grade_xtrain = {}
for a in X_train['clean_project_grade_category'] :
if a not in proj_grade_xtrain :
proj_grade_xtrain[a] = 1
else :
proj_grade_xtrain[a] += 1
pos_proj_grade = {}
for p in proj_grade_xtrain.keys():
pos_proj_grade[p] = (proj_grade_pos[p])/float(proj_grade_xtrain[p])
neg_proj_grade = {}
for n in proj_grade_xtrain.keys():
neg_proj_grade[n] = (proj_grade_neg[n])/float(proj_grade_xtrain[n])
proj_grade_0n_xtrain = []
proj_grade_1p_xtrain = []
for a in X_train["clean_project_grade_category"] :
proj_grade_0n_xtrain.append(neg_proj_grade[a])
proj_grade_1p_xtrain.append(pos_proj_grade[a])
X_train["proj_grade_0n"] = proj_grade_0n_xtrain
X_train["proj_grade_1p"] = proj_grade_1p_xtrain
proj_grade_0n_xtest = []
proj_grade_1p_xtest = []
for a in X_test["clean_project_grade_category"] :
proj_grade_0n_xtest.append(neg_proj_grade[a])
proj_grade_1p_xtest.append(pos_proj_grade[a])
flag = 0
for i in range(len(proj_grade_0n_xtest)):
if(math.isnan(proj_grade_0n_xtest[i])):
flag = 1
print(flag)
X_test["proj_grade_0n"] = proj_grade_0n_xtest
X_test["proj_grade_1p"] = proj_grade_1p_xtest
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
proj_grade_std0 = StandardScaler()
proj_grade_std0.fit(X_train['proj_grade_0n'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
proj_grade_0n_xtrain = proj_grade_std0 .transform(X_train['proj_grade_0n'].values.reshape(-1, 1))
proj_grade_0n_xtest = proj_grade_std0 .transform(X_test['proj_grade_0n'].values.reshape(-1, 1))
print(proj_grade_0n_xtrain.shape)
print(proj_grade_0n_xtest.shape)
#https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
proj_grade_std1 = StandardScaler()
proj_grade_std1.fit(X_train['proj_grade_1p'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
proj_grade_1p_xtrain = proj_grade_std1.transform(X_train['proj_grade_1p'].values.reshape(-1, 1))
proj_grade_1p_xtest = proj_grade_std1.transform(X_test['proj_grade_1p'].values.reshape(-1, 1))
print(proj_grade_1p_xtrain.shape)
print(proj_grade_1p_xtest.shape)
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329. ... 399. 287.73 5.5 ].
# Reshape your data either using array.reshape(-1, 1)
price_scalar = StandardScaler()
price_scalar.fit(X_train['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")
# Now standardize the data with above maen and variance.
price_standardized_xtrain = price_scalar.transform(X_train['price'].values.reshape(-1, 1))
#price_standardized_xcv = price_scalar.transform(X_cv['price'].values.reshape(-1, 1))
price_standardized_xtest = price_scalar.transform(X_test['price'].values.reshape(-1, 1))
print("shape of price_standardized_xtrain",price_standardized_xtrain.shape)
#print("shape of price_standardized_xcv",price_standardized_xcv.shape)
print("shape of price_standardized_xtest",price_standardized_xtest.shape)
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329. ... 399. 287.73 5.5 ].
# Reshape your data either using array.reshape(-1, 1)
quantity_scalar = StandardScaler()
quantity_scalar.fit(X_train['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# print(f"Mean : {quantity_scalar.mean_[0]}, Standard deviation : {np.sqrt(quantity_scalar.var_[0])}")
# Now standardize the data with above maen and variance.
quantity_standardized_xtrain = quantity_scalar.transform(X_train['quantity'].values.reshape(-1, 1))
#quantity_standardized_xcv = quantity_scalar.transform(X_cv['quantity'].values.reshape(-1, 1))
quantity_standardized_xtest = quantity_scalar.transform(X_test['quantity'].values.reshape(-1, 1))
print("shape of quantity_standardized_xtrain",quantity_standardized_xtrain.shape)
#print("shape of quantity_standardized_xcv",quantity_standardized_xcv.shape)
print("shape of quantity_standardized_xtest",quantity_standardized_xtest.shape)
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler
# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329. ... 399. 287.73 5.5 ].
# Reshape your data either using array.reshape(-1, 1)
teacher_num_prev_projects_scalar = StandardScaler()
teacher_num_prev_projects_scalar.fit(X_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
# print(f"Mean : {teacher_number_of_previously_posted_projects_scalar.mean_[0]}, Standard deviation : {np.sqrt(teacher_number_of_previously_posted_projects_scalar.var_[0])}")
# Now standardize the data with above maen and variance.
teacher_num_prev_projects_standardized_xtrain = teacher_num_prev_projects_scalar.transform(X_train['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
#teacher_num_prev_projects_standardized_xcv = teacher_num_prev_projects_scalar.transform(X_cv['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
teacher_num_prev_projects_standardized_xtest = teacher_num_prev_projects_scalar.transform(X_test['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
print(" shape of teacher_number_of_previously_posted_projects_standardized_xtrain",teacher_num_prev_projects_standardized_xtrain.shape)
#print(" shape of teacher_number_of_previously_posted_projects_standardized_xcv",teacher_num_prev_projects_standardized_xcv.shape)
print(" shape of teacher_number_of_previously_posted_projects_standardized_xtest",teacher_num_prev_projects_standardized_xtest.shape)
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# make sure you featurize train and test data separatly
# when you plot any graph make sure you use
# a. Title, that describes your plot, this will be very helpful to the reader
# b. Legends if needed
# c. X-axis label
# d. Y-axis label
# BOW on eassay
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer_bow_essays = CountVectorizer(min_df=10,max_features=5000,ngram_range=(1,2))
vectorizer_bow_essays.fit(X_train['preprocessed_essays'])
essay_text_bow_xtrain = vectorizer_bow_essays.transform(X_train['preprocessed_essays'])
#essay_text_bow_xcv = vectorizer_bow_essays.transform(X_cv['preprocessed_essays'])
essay_text_bow_xtest = vectorizer_bow_essays.transform(X_test['preprocessed_essays'])
print("Shape of matrix after BOW_text_essay X_train ",essay_text_bow_xtrain.shape)
#print("Shape of matrix after BOW_text_essay X_cv ",essay_text_bow_xcv.shape)
print("Shape of matrix after BOW_text_essay X_test ",essay_text_bow_xtest.shape)
# BOW on project_title
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer_bow_titles = CountVectorizer(min_df=10)
vectorizer_bow_titles.fit(X_train['preprocessed_titles'])
proj_title_bow_xtrain = vectorizer_bow_titles.transform(X_train['preprocessed_titles'])
#proj_title_bow_xcv = vectorizer_bow_titles.transform(X_cv['preprocessed_titles'])
proj_title_bow_xtest = vectorizer_bow_titles.transform(X_test['preprocessed_titles'])
print("Shape of matrix after BOW project_title_xtrain ",proj_title_bow_xtrain.shape)
#print("Shape of matrix after BOW project_title_xcv ",proj_title_bow_xcv.shape)
print("Shape of matrix after BOW project_title_xtest ",proj_title_bow_xtest.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tfidf_essays = TfidfVectorizer(min_df=10,max_features=5000,ngram_range=(1,2))
vectorizer_tfidf_essays.fit(X_train['preprocessed_essays'])
essay_tfidf_xtrain = vectorizer_tfidf_essays.transform(X_train['preprocessed_essays'])
#essay_tfidf_xcv = vectorizer_tfidf_essays.transform(X_cv['preprocessed_essays'])
essay_tfidf_xtest = vectorizer_tfidf_essays.transform(X_test['preprocessed_essays'])
print("Shape of matrix after tfidf eassay_xtrain ",essay_tfidf_xtrain.shape)
#print("Shape of matrix after tfidf essay_xcv ",essay_tfidf_xcv.shape)
print("Shape of matrix after tfidf essay_xtest ",essay_tfidf_xtest.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tfidf_title = TfidfVectorizer(min_df=10)
vectorizer_tfidf_title.fit(X_train['preprocessed_titles'])
proj_title_tfidf_xtrain = vectorizer_tfidf_title.transform(X_train['preprocessed_titles'])
#proj_title_tfidf_xcv = vectorizer_tfidf_title.transform(X_cv['preprocessed_titles'])
proj_title_tfidf_xtest = vectorizer_tfidf_title.transform(X_test['preprocessed_titles'])
print("Shape of matrix after tfidf proj_title_xtrain ",proj_title_tfidf_xtrain.shape)
#print("Shape of matrix after tfidf proj_title_xcv ",proj_title_tfidf_xcv.shape)
print("Shape of matrix after tfidf proj_title_xtest ",proj_title_tfidf_xtest.shape)
# Using Pretrained Models: Avg W2V
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file
with open('glove_vectors', 'rb') as f:
model = pickle.load(f)
glove_words = set(model.keys())
# average Word2Vec
# compute average word2vec for each review.
# average Word2Vec on X_train
essay_avg_w2v_vectors_xtrain = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['preprocessed_essays']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if word in glove_words:
vector += model[word]
cnt_words += 1
if cnt_words != 0:
vector /= cnt_words
essay_avg_w2v_vectors_xtrain.append(vector)
print(len(essay_avg_w2v_vectors_xtrain))
print(len(essay_avg_w2v_vectors_xtrain[0]))
# average Word2Vec on X_test
essay_avg_w2v_vectors_xtest = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['preprocessed_essays']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if word in glove_words:
vector += model[word]
cnt_words += 1
if cnt_words != 0:
vector /= cnt_words
essay_avg_w2v_vectors_xtest.append(vector)
print(len(essay_avg_w2v_vectors_xtest))
print(len(essay_avg_w2v_vectors_xtest[0]))
# average Word2Vec
# compute average word2vec for each review.
# average Word2Vec on X_train
proj_title_avg_w2v_vectors_xtrain = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['preprocessed_titles']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if word in glove_words:
vector += model[word]
cnt_words += 1
if cnt_words != 0:
vector /= cnt_words
proj_title_avg_w2v_vectors_xtrain.append(vector)
print(len(proj_title_avg_w2v_vectors_xtrain))
print(len(proj_title_avg_w2v_vectors_xtrain[0]))
# average Word2Vec on X_test
proj_title_avg_w2v_vectors_xtest = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['preprocessed_titles']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
cnt_words =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if word in glove_words:
vector += model[word]
cnt_words += 1
if cnt_words != 0:
vector /= cnt_words
proj_title_avg_w2v_vectors_xtest.append(vector)
print(len(proj_title_avg_w2v_vectors_xtest))
print(len(proj_title_avg_w2v_vectors_xtest[0]))
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
tfidf_model = TfidfVectorizer()
tfidf_model.fit(X_train['preprocessed_essays'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())
# average Word2Vec
# average Word2Vec on X_train
essay_tfidf_w2v_vectors_xtrain = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['preprocessed_essays']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
tf_idf_weight =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if (word in glove_words) and (word in tfidf_words):
vec = model[word] # getting the vector for each word
# here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
vector += (vec * tf_idf) # calculating tfidf weighted w2v
tf_idf_weight += tf_idf
if tf_idf_weight != 0:
vector /= tf_idf_weight
essay_tfidf_w2v_vectors_xtrain.append(vector)
print(len(essay_tfidf_w2v_vectors_xtrain))
print(len(essay_tfidf_w2v_vectors_xtrain[0]))
# average Word2Vec on X_train
essay_tfidf_w2v_vectors_xtest = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['preprocessed_essays']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
tf_idf_weight =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if (word in glove_words) and (word in tfidf_words):
vec = model[word] # getting the vector for each word
# here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
vector += (vec * tf_idf) # calculating tfidf weighted w2v
tf_idf_weight += tf_idf
if tf_idf_weight != 0:
vector /= tf_idf_weight
essay_tfidf_w2v_vectors_xtest.append(vector)
print(len(essay_tfidf_w2v_vectors_xtest))
print(len(essay_tfidf_w2v_vectors_xtest[0]))
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
tfidf_model = TfidfVectorizer()
tfidf_model.fit(X_train['preprocessed_titles'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())
# TFIDF weighted W2V on Project Title
# compute average word2vec for each review.
# TFIDF weighted W2V on X_train
proj_title_tfidf_w2v_vectors_xtrain = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['preprocessed_titles']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
tf_idf_weight =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if (word in glove_words) and (word in tfidf_words):
vec = model[word] # getting the vector for each word
# here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
vector += (vec * tf_idf) # calculating tfidf weighted w2v
tf_idf_weight += tf_idf
if tf_idf_weight != 0:
vector /= tf_idf_weight
proj_title_tfidf_w2v_vectors_xtrain.append(vector)
print(len(proj_title_tfidf_w2v_vectors_xtrain))
print(len(proj_title_tfidf_w2v_vectors_xtrain[0]))
# TFIDF weighted W2V on X_test
proj_title_tfidf_w2v_vectors_xtest = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['preprocessed_titles']): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
tf_idf_weight =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if (word in glove_words) and (word in tfidf_words):
vec = model[word] # getting the vector for each word
# here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
vector += (vec * tf_idf) # calculating tfidf weighted w2v
tf_idf_weight += tf_idf
if tf_idf_weight != 0:
vector /= tf_idf_weight
proj_title_tfidf_w2v_vectors_xtest.append(vector)
print(len(proj_title_tfidf_w2v_vectors_xtest))
print(len(proj_title_tfidf_w2v_vectors_xtest[0]))
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# when you plot any graph make sure you use
# a. Title, that describes your plot, this will be very helpful to the reader
# b. Legends if needed
# c. X-axis label
# d. Y-axis label
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train1=hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain, quantity_standardized_xtrain,
essay_text_bow_xtrain, proj_title_bow_xtrain)).tocsr()
X_test1=hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest, quantity_standardized_xtest,
essay_text_bow_xtest, proj_title_bow_xtest)).tocsr()
print(X_train1.shape, y_train.shape)
print(X_test1.shape, y_test.shape)
import warnings
warnings.filterwarnings('ignore')
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
start_time = time.time()
rfclf1 = RandomForestClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [10, 300, 500, 700], 'max_depth':[10, 30, 60, 100]}
clf1 = GridSearchCV(rfclf1, parameters, cv= 5, scoring='roc_auc',return_train_score=True)
clf1.fit(X_train1, y_train)
train_auc= clf1.cv_results_['mean_train_score']
train_auc_std= clf1.cv_results_['std_train_score']
cv_auc = clf1.cv_results_['mean_test_score']
cv_auc_std= clf1.cv_results_['std_test_score']
print("Total Execution time: " + str((time.time() - start_time)) + ' ms')
## 8.15-11.17=3 hours
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [10, 30, 60, 100])
plt.yticks(np.arange(4), [10, 300, 500, 700])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [10, 30, 60, 100])
plt.yticks(np.arange(4), [10, 300, 500, 700])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.metrics import roc_curve, auc
modelbow = RandomForestClassifier(max_depth = 10, n_estimators = 500,n_jobs=-1,class_weight='balanced')
modelbow.fit(X_train1, y_train)
y_train_pred = batch_predict(modelbow, X_train1)
y_test_pred = batch_predict(modelbow, X_test1)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr)))
# Confusion matrix for train data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtrain = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtrain, annot=True,annot_kws={"size": 16}, fmt='g')# font size
# Confusion matrix for test data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtest = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtest, annot=True,annot_kws={"size": 16}, fmt='g')#font size
import dill
#dill.dump_session('notebook_4_11.db')
dill.load_session('notebook_4_11.db')
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train2=hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain,
quantity_standardized_xtrain,essay_tfidf_xtrain, proj_title_tfidf_xtrain)).tocsr()
X_test2=hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest, quantity_standardized_xtest,
essay_tfidf_xtest, proj_title_tfidf_xtest)).tocsr()
print(X_train2.shape)
print(X_test2.shape)
import warnings
warnings.filterwarnings('ignore')
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
start_time = time.time()
rfclf2 = RandomForestClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [10, 300, 500, 700], 'max_depth':[10, 30, 60, 100]}
clf2 = GridSearchCV(rfclf2, parameters, cv= 5, scoring='roc_auc',return_train_score=True)
clf2.fit(X_train2, y_train)
train_auc= clf2.cv_results_['mean_train_score']
train_auc_std= clf2.cv_results_['std_train_score']
cv_auc = clf2.cv_results_['mean_test_score']
cv_auc_std= clf2.cv_results_['std_test_score']
print("Total Execution time: " + str((time.time() - start_time)) + ' ms')
import dill
#dill.dump_session('notebook_44_11.db')
dill.load_session('notebook_44_11.db')
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [10, 30, 60, 100])
plt.yticks(np.arange(4), [10, 300, 500, 700])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc =cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [10, 30, 60, 100])
plt.yticks(np.arange(4), [10, 300, 500, 700])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
modeltfidf = RandomForestClassifier(max_depth = 10, n_estimators = 100,n_jobs=-1,class_weight='balanced')
modeltfidf.fit(X_train2, y_train)
y_train_pred = batch_predict(modeltfidf, X_train2)
y_test_pred = batch_predict(modeltfidf, X_test2)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
# Confusion Matrix for Train Data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtrain = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtrain, annot=True,annot_kws={"size": 16}, fmt='g')#font size
# Confusion matrix for test data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtest = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtest, annot=True,annot_kws={"size": 16}, fmt='g')#font size
essay_avg_w2v_vectors_xtrain= np.array(essay_avg_w2v_vectors_xtrain)
proj_title_avg_w2v_vectors_xtrain= np.array(proj_title_avg_w2v_vectors_xtrain)
essay_avg_w2v_vectors_xtest= np.array(essay_avg_w2v_vectors_xtest)
proj_title_avg_w2v_vectors_xtest= np.array(proj_title_avg_w2v_vectors_xtest)
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train3=np.hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain,
essay_avg_w2v_vectors_xtrain, proj_title_avg_w2v_vectors_xtrain))
X_test3=np.hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest,
essay_avg_w2v_vectors_xtest, proj_title_avg_w2v_vectors_xtest))
print(X_train3.shape, y_train.shape)
print(X_test3.shape, y_test.shape)
import warnings
warnings.filterwarnings('ignore')
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
start_time = time.time()
rfclf3 = RandomForestClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [10, 300, 500, 700], 'max_depth':[10, 30, 60, 100]}
clf3 = GridSearchCV(rfclf3, parameters, cv= 5, scoring='roc_auc',return_train_score=True)
clf3.fit(X_train3, y_train)
train_auc= clf3.cv_results_['mean_train_score']
train_auc_std= clf3.cv_results_['std_train_score']
cv_auc = clf3.cv_results_['mean_test_score']
cv_auc_std= clf3.cv_results_['std_test_score']
print("Total Execution time: " + str((time.time() - start_time)) + ' ms')
# 9.53am-5.28pm
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [10, 30, 60, 100])
plt.yticks(np.arange(4), [10, 300, 500, 700])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = train_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [10, 30, 60, 100])
plt.yticks(np.arange(4), [10, 300, 500, 700])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
modelavgw2v = RandomForestClassifier(max_depth = 10, n_estimators = 10,n_jobs=-1,class_weight='balanced')
modelavgw2v.fit(X_train3, y_train)
y_train_pred = batch_predict(modelavgw2v, X_train3)
y_test_pred = batch_predict(modelavgw2v, X_test3)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
# Confusion Matrix for Train Data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtrain = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtrain, annot=True,annot_kws={"size": 16}, fmt='g')
# Confusion matrix for test data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtest = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtest, annot=True,annot_kws={"size": 16}, fmt='g')
essay_tfidf_w2v_vectors_xtrain=np.array(essay_tfidf_w2v_vectors_xtrain)
proj_title_tfidf_w2v_vectors_xtrain=np.array(proj_title_tfidf_w2v_vectors_xtrain)
essay_tfidf_w2v_vectors_xtest=np.array(essay_tfidf_w2v_vectors_xtest)
proj_title_tfidf_w2v_vectors_xtest=np.array(proj_title_tfidf_w2v_vectors_xtest)
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train4=np.hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain, quantity_standardized_xtrain,
essay_tfidf_w2v_vectors_xtrain, proj_title_tfidf_w2v_vectors_xtrain))
X_test4=np.hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest, quantity_standardized_xtest,
essay_tfidf_w2v_vectors_xtest, proj_title_tfidf_w2v_vectors_xtest))
print(X_train4.shape, y_train.shape)
print(X_test4.shape, y_test.shape)
import warnings
warnings.filterwarnings('ignore')
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
start_time = time.time()
rfclf4 = RandomForestClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [5, 10, 50, 100], 'max_depth':[2, 5, 7, 10]}
clf4 = GridSearchCV(rfclf4, parameters, cv= 5, scoring='roc_auc',return_train_score=True)
clf4.fit(X_train4, y_train)
train_auc= clf4.cv_results_['mean_train_score']
train_auc_std= clf4.cv_results_['std_train_score']
cv_auc = clf4.cv_results_['mean_test_score']
cv_auc_std= clf4.cv_results_['std_test_score']
print("Total Execution time: " + str((time.time() - start_time)) + ' ms')
#6.36pm-7.05
# Testing the performance of the model on test data, plotting ROC Curves
# Select best log(C) value
best_set_tfidfw2v = clf4.best_params_
print(best_set_tfidfw2v)
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
modeltfidfw2v = RandomForestClassifier(max_depth = 7, n_estimators = 100,n_jobs=-1,class_weight='balanced')
modeltfidfw2v.fit(X_train4, y_train)
y_train_pred = batch_predict(modeltfidfw2v, X_train4)
y_test_pred = batch_predict(modeltfidfw2v, X_test4)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
# Confusion Matrix for Train Data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtrain = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtrain, annot=True,annot_kws={"size": 16}, fmt='g')
# Confusion matrix for test data
# Code for this segment from here -->> https://stackoverflow.com/questions/35572000/how-can-i-plot-a-confusion-matrix
conf_matrix_xtest = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matrix_xtest, annot=True,annot_kws={"size": 16}, fmt='g')
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train1=hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain, quantity_standardized_xtrain,
essay_text_bow_xtrain, proj_title_bow_xtrain)).tocsr()
X_test1=hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest, quantity_standardized_xtest,
essay_text_bow_xtest, proj_title_bow_xtest)).tocsr()
print(X_train1.shape, y_train.shape)
print(X_test1.shape, y_test.shape)
import dill
dill.dump_session('notebook_6_11.db')
#dill.load_session('notebook_env11.db')
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import time
start_time = time.time()
gbdt1 = xgb.XGBClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [5, 10, 50, 100], 'max_depth':[2, 5, 7, 10]}
clfgbdt1 = GridSearchCV(gbdt1, parameters, cv= 3, scoring='roc_auc',return_train_score=True)
clfgbdt1.fit(X_train1, y_train)
train_auc= clfgbdt1.cv_results_['mean_train_score']
train_auc_std= clfgbdt1.cv_results_['std_train_score']
cv_auc = clfgbdt1.cv_results_['mean_test_score']
cv_auc_std= clfgbdt1.cv_results_['std_test_score']
print("Execution time: " + str((time.time() - start_time)) + ' ms')
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
# Testing the performance of the model on test data, plotting ROC Curves
# Select best log(C) value
best_set1_xgb = clfgbdt1.best_params_
print(best_set1_xgb)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.metrics import roc_curve, auc
gbdt1 = xgb.XGBClassifier(max_depth = 5, n_estimators = 100,n_jobs=-1,class_weight='balanced')
gbdt1.fit(X_train1, y_train)
y_train_pred = batch_predict(gbdt1, X_train1)
y_test_pred = batch_predict(gbdt1, X_test1)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
conf_matr_df_train = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_train, annot=True,annot_kws={"size": 16}, fmt='g')
conf_matr_df_test = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_test, annot=True,annot_kws={"size": 16}, fmt='g')
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train2=hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain,
quantity_standardized_xtrain,essay_tfidf_xtrain, proj_title_tfidf_xtrain)).tocsr()
X_test2=hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest, quantity_standardized_xtest,
essay_tfidf_xtest, proj_title_tfidf_xtest)).tocsr()
print(X_train2.shape)
print(X_test2.shape)
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import time
start_time = time.time()
gbdt2 = xgb.XGBClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [5, 10, 50, 100], 'max_depth':[2, 5, 7, 10]}
clfgbdt2 = GridSearchCV(gbdt2, parameters, cv= 3, scoring='roc_auc',return_train_score=True)
clfgbdt2.fit(X_train2, y_train)
train_auc= clfgbdt2.cv_results_['mean_train_score']
train_auc_std= clfgbdt2.cv_results_['std_train_score']
cv_auc = clfgbdt2.cv_results_['mean_test_score']
cv_auc_std= clfgbdt2.cv_results_['std_test_score']
print("Execution time: " + str((time.time() - start_time)) + ' ms')
train_auc = train_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.metrics import roc_curve, auc
gbdt2 = xgb.XGBClassifier(max_depth = 5, n_estimators = 100,n_jobs=-1,class_weight='balanced')
gbdt2.fit(X_train2, y_train)
y_train_pred = batch_predict(gbdt2, X_train2)
y_test_pred = batch_predict(gbdt2, X_test2)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
conf_matr_df_train = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_train, annot=True,annot_kws={"size": 16}, fmt='g')
conf_matr_df_test = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_test, annot=True,annot_kws={"size": 16}, fmt='g')
essay_avg_w2v_vectors_xtrain= np.array(essay_avg_w2v_vectors_xtrain)
proj_title_avg_w2v_vectors_xtrain= np.array(proj_title_avg_w2v_vectors_xtrain)
essay_avg_w2v_vectors_xtest= np.array(essay_avg_w2v_vectors_xtest)
proj_title_avg_w2v_vectors_xtest= np.array(proj_title_avg_w2v_vectors_xtest)
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train3=np.hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain,
essay_avg_w2v_vectors_xtrain, proj_title_avg_w2v_vectors_xtrain))
X_test3=np.hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest,
essay_avg_w2v_vectors_xtest, proj_title_avg_w2v_vectors_xtest))
print(X_train3.shape, y_train.shape)
print(X_test3.shape, y_test.shape)
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import time
start_time = time.time()
gbdt3 = xgb.XGBClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [5, 10, 50, 100], 'max_depth':[2, 5, 7, 10]}
clfgbdt3 = GridSearchCV(gbdt3, parameters, cv= 3, scoring='roc_auc',return_train_score=True)
clfgbdt3.fit(X_train3, y_train)
train_auc= clfgbdt3.cv_results_['mean_train_score']
train_auc_std= clfgbdt3.cv_results_['std_train_score']
cv_auc = clfgbdt3.cv_results_['mean_test_score']
cv_auc_std= clfgbdt3.cv_results_['std_test_score']
print("Execution time: " + str((time.time() - start_time)) + ' ms')
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
# Testing the performance of the model on test data, plotting ROC Curves
# Select best log(C) value
best_set3_xgb = clfgbdt3.best_params_
print(best_set3_xgb)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.metrics import roc_curve, auc
gbdt3 = xgb.XGBClassifier(max_depth = 5, n_estimators = 50,n_jobs=-1,class_weight='balanced')
gbdt3.fit(X_train3, y_train)
y_train_pred = batch_predict(gbdt3, X_train3)
y_test_pred = batch_predict(gbdt3, X_test3)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
conf_matr_df_train = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_train, annot=True,annot_kws={"size": 16}, fmt='g')
conf_matr_df_test = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_test, annot=True,annot_kws={"size": 16}, fmt='g')
essay_tfidf_w2v_vectors_xtrain=np.array(essay_tfidf_w2v_vectors_xtrain)
proj_title_tfidf_w2v_vectors_xtrain=np.array(proj_title_tfidf_w2v_vectors_xtrain)
essay_tfidf_w2v_vectors_xtest=np.array(essay_tfidf_w2v_vectors_xtest)
proj_title_tfidf_w2v_vectors_xtest=np.array(proj_title_tfidf_w2v_vectors_xtest)
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_train4=np.hstack((cat_0n_xtrain, cat_1p_xtrain, sub_cat_0n_xtrain, sub_cat_1p_xtrain, school_state_0n_xtrain,
school_state_1p_xtrain, teacher_prefix_0n_xtrain, teacher_prefix_1p_xtrain,
proj_grade_0n_xtrain, proj_grade_1p_xtrain, price_standardized_xtrain,
teacher_num_prev_projects_standardized_xtrain, quantity_standardized_xtrain,
essay_tfidf_w2v_vectors_xtrain, proj_title_tfidf_w2v_vectors_xtrain))
X_test4=np.hstack((cat_0n_xtest, cat_1p_xtest, sub_cat_0n_xtest, sub_cat_1p_xtest, school_state_0n_xtest,
school_state_1p_xtest, teacher_prefix_0n_xtest, teacher_prefix_1p_xtest,
proj_grade_0n_xtest, proj_grade_1p_xtest, price_standardized_xtest,
teacher_num_prev_projects_standardized_xtest, quantity_standardized_xtest,
essay_tfidf_w2v_vectors_xtest, proj_title_tfidf_w2v_vectors_xtest))
print(X_train4.shape, y_train.shape)
print(X_test4.shape, y_test.shape)
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import time
start_time = time.time()
gbdt4 = xgb.XGBClassifier(n_jobs=-1,class_weight='balanced')
parameters = {'n_estimators': [5, 10, 50, 100], 'max_depth':[2, 5, 7, 10]}
clfgbdt4 = GridSearchCV(gbdt4, parameters, cv= 3, scoring='roc_auc',return_train_score=True)
clfgbdt4.fit(X_train4, y_train)
train_auc= clfgbdt4.cv_results_['mean_train_score']
train_auc_std= clfgbdt4.cv_results_['std_train_score']
cv_auc = clfgbdt4.cv_results_['mean_test_score']
cv_auc_std= clfgbdt4.cv_results_['std_test_score']
print("Execution time: " + str((time.time() - start_time)) + ' ms')
import dill
dill.dump_session('notebook_71_11.db')
#dill.load_session('notebook_71_11.db')
train_auc = train_auc.reshape(4,4)
train_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(train_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
cv_auc = cv_auc.reshape(4,4)
cv_auc
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns
sns.heatmap(cv_auc,annot=True)
plt.xticks(np.arange(4), [2, 5, 7, 10])
plt.yticks(np.arange(4), [5, 10, 50, 100])
plt.xlabel('max_depth')
plt.ylabel('n_estimators')
plt.show()
# Testing the performance of the model on test data, plotting ROC Curves
# Select best log(C) value
best_set4_xgb = clfgbdt4.best_params_
print(best_set4_xgb)
def batch_predict(clf, data):
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs
y_data_pred = []
tr_loop = data.shape[0] - data.shape[0]%1000
# consider you X_tr shape is 49041, then your cr_loop will be 49041 - 49041%1000 = 49000
# in this for loop we will iterate unti the last 1000 multiplier
for i in range(0, tr_loop, 1000):
y_data_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
# we will be predicting for the last data points
y_data_pred.extend(clf.predict_proba(data[tr_loop:])[:,1])
return y_data_pred
from sklearn.metrics import roc_curve, auc
gbdt4 = xgb.XGBClassifier(max_depth = 5, n_estimators = 50,n_jobs=-1,class_weight='balanced')
gbdt4.fit(X_train4, y_train)
y_train_pred = batch_predict(gbdt4, X_train4)
y_test_pred = batch_predict(gbdt4, X_test4)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
plt.plot(train_fpr, train_tpr, label="Train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="Test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("False Positive Rate(FPR)")
plt.ylabel("True Positive Rate(TPR)")
plt.title("AUC")
plt.grid()
plt.show()
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
t = threshould[np.argmax(tpr*(1-fpr))]
# (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
predictions = []
for i in proba:
if i>=t:
predictions.append(1)
else:
predictions.append(0)
return predictions
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
print(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
print("Test confusion matrix")
print(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
conf_matr_df_train = pd.DataFrame(confusion_matrix(y_train[:], predict(y_train_pred, tr_thresholds, train_fpr, train_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_train, annot=True,annot_kws={"size": 16}, fmt='g')
conf_matr_df_test = pd.DataFrame(confusion_matrix(y_test[:], predict(y_test_pred, tr_thresholds, test_fpr, test_tpr)))
sns.set(font_scale=1.4)#for label size
sns.heatmap(conf_matr_df_test, annot=True,annot_kws={"size": 16}, fmt='g')
# Please compare all your models using Prettytable library
from prettytable import PrettyTable
x = PrettyTable()
x.field_names = ["Vectorizer", "Model", "Hyperparameters[max_depth,n_estimators]" , "Test AUC"]
x.add_row(["BOW", "RF","[10,500]", 0.70001])
x.add_row(["TFIDF", "RF", "[10,100]", 0.6903])
x.add_row(["AVG W2V", "RF", "[10,10]", 0.61638])
x.add_row(["TFIDF W2V", "RF", "[10,100]", 0.69172])
x.add_row(["BOW", "GBDT","[5,100]", 0.71415])
x.add_row(["TFIDF", "GBDT", "[5,100]", 0.71747])
x.add_row(["AVG W2V", "GBDT", "[5,50]", 0.6825])
x.add_row(["TFIDF W2V", "GBDT", "[5,50]", 0.70865])
print(x)